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Abstract 

One of the central problems in the classification of individual test-sequences 
(e.g. genetic analysis), is that of checking for the similarity of a sample test 
sequence as compared with a very long training sequence that contains specific 
features that are sought in the test sequence. It should be noted that the storage 
of long training sequences is considered to be a serious bottleneck in the next 
generation sequencing for Genome analysis. 

Some popular classification algorithms adopt a probabilistic approach, by as- 
suming that the sequences are realizations of some variable-length Markov pro- 
cess or a hidden Markov process (HMM), thus enabling the imbedding of the 
training data onto a variable-length Suffix-tree, the size of which is usually linear 
in N, the length of the test sequence. 

Despite the fact that it is not assumed here that the sequences are realizations 
of probabilistic processes (an assumption that does not seem to be fully justified 
when dealing with biological data), it is demonstrated that any classifier may, 
without any loss in generality, always be based on a universal compaction of the 
training data that is contained in a (long) individual training sequence, onto a 
suffix-tree with no more than 0{N) leaves, regardless of how long the training 
sequence is, at only a vanishing increase in the misclassification error rate. 

Keywords : universal classification, universal compression, bio-informatics. 
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Introduction: 



Upon observing an individual training A^'-sequence Y = {yi,y2, Vn'), a classifier searches 
for some typical features that may be imbedded in the training sequence and also appear in 
a test A^-sequence ^ = {xi,X2, ■■■,xn). 

One of the central problems in the classification of individual test-sequences (e.g. Genetic 
analysis), is that of checking the similarity of a sample test sequence against a training 
sequence (which in many cases is much longer than the test sequence) that contains specific 
features that are sought in the test sequence .Upon observing an individual training A^'- 
sequcncc Y = {yi,y2, yN'), a classifier searches for some typical features that may also be 
imbedded in the training sequence and also appear in a test A'"-sequence X = {xi, X2, ■■■,XN)■ 
lt should be noted that data-storage is considered to be a serious bottleneck in the next 
generation sequencing for genome analysis" (quoting Prof. Stuart M. Brown, NYU Genome 
Center) . 

Some popular classification algorithms adopt a probabilistic approach, by assuming that 
the sequences are reahzations of some variable-length Markov process or a hidden Markov 
process (HMM), thus enabling the imbedding of the training data in a variable- length Suffix- 
trcc, the size of which is usually linear in A^, the length of the test scqucncc[c.g. Bejerano 
et al, 2001, Giancarlo P. et al 2009, Reinert G. et al, 2009, Uhtsky I et al , 2009]. 

Despite the fact that the probabilistic approach is not theoretically justified, it apparently 
led to good empirical classification results. 

In the case of data compression of long sequences, an alternative to the probabilistic 
approach was established: The stream of data to be compressed is assumed to be a non- 
probabilistic individual sequence. 

The assumption that the compression is carried out via a universal Turing machine led to 
the notion of Kolmogorov complexity, which is the best asymptotic compression ratio that 
may be achieved for the individual sequence by any computer. 
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A more conceptually restricted, but practical approach was obtained by replacing the 
universal Turing machine model by a finite-state machine (FSM) model or a finite block- 
length compression model (LZ [Ziv, J. and Lempel, A 1977], Ziv, J. 2008, CTW [Willems, 
F.M.J, ct al 1995, Weinberger, M.et al 1995]. and led to an associated suffix-tree data base 
with 0{N') leaves, where all leaves have about the same empirical probability of appearanc, 
and where A^' is the length of the sequence to be compressed. 

It has been demonstrated that the common probabilistic modeling approach for prediction 
tasks may be replaced by an individual sequence approach as well. In this case too, organizing 
the data base in the form of a variable-length suffix-tree (context-tree) with leaves that 
have about the same empirical probability of appearance of suffixes, led to efficient on-line 
prediction [Ziv, J and Merhav, N. 2007 ]. A similar approach is adapted here, studying 
the performance of universal classification of an individual test sequence relative to a long 
individual training sequence. 

Despite of the fact that it is not assumed that the sequences are realizations of a prob- 
abilistic process (an assumption that does not seem to be fully justified when dealing with 
biological data), it is demonstrated that optimal classification may be based on the com- 
paction of the individual, long training sequence onto a suffix-tree with no more than 0{N) 
leaves (rather than 0{N') leaves as is in the data compression case), regardless of how long 
the training sequence is, at the cost of only a negligible increase in the misclassification error 
rate. 

Furthermore, the generation algorithm of the suffix-tree from the training sequence is 
universal, since it does not depend on the specific features that are imbedded in the training 
sequence, thus yielding a formal justification for efficiency of classifiers that are associated 
with the compaction of the training data onto a suffix-tree data-base with an 0(N) storage 
complexity, without relying on any a-priori probabilistic assumptions. 
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Features and Similarity: 



A feature is a distinct function of some substring Yi{j) = {yi,yi^i, ...yi^j+i); j G [1, — 1] 
of Y. Not every substring of Y necessarily supports a feature, and no feature-supporting 
substring is a prefix of another longer feature-supporting substring. 

Let F(Y) be a set of features that are imbedded in a (long) training sequence Y and are 
sought in the test sequence X. 

The discussion is restricted to the case where, given a collection of observed features of 
Y in X, the classifier has to decide if X is similar enough to Y, and should be declared to 
be acceptable. Also , given two test sequences X and X', decide which one is more similar 
(relative to some similarity measure) to Y. Two tasks are typically considered: 

Filtering: 

Upon observing a collection of features of Y that appear in X, decide if the test sequence 
X is similar enough {acceptable) (to Y relative to the training sequence Y. 

Clearly, a test sequence that contains no element of F(Y) should be declared by an 
effective classifier to be not — acceptable relative to Y. 

Sorting: 

Sorting of test sequences X that passed the filtering stage, by their degree of "similarity" to 
the training sequence Y 

Example: 

Y=ABACDCBEDEDE 
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Features: A,BA , C, CD 
X=AABDADAD 
Features in X :A, A, BA, A, A 

Consider, for example a version of Average Common Length (ACL) classification algo- 
rithm) [Uhtsky, et al, 2006] : 

Let the set of features F{Y) consist of distinct /(Y) substrings Yj(j); 1 < j < L^ax in 
Y, that are leaves of a full tree, and where each being a prefix of some "trimmed" suffix 
Yi{j)] 1 < J < i - 1; 1 < j < TV' - 1 of Y. 

Let L(Y) denote the empirical average length of the elements of F{Y) when shding along 

Y. 

Let L(X|Y) be the empirical average length of "trimmed" suffixes Xi(j); j < Lmax in X 
that are elements of F{Y). 

Declare that X is similar to Y iff: 

i,(X|Y) = I-'^ly' - I-<^> > T (1) 

where T is a preset threshold. Here D(X|Y) is a measure of the similarity of X to Y. 

Observe that, given a particular set F{Y), then all it's features may be mapped onto a 
suffix-tree with no more than /(Y) leaves. However, we are looking for universal compaction 
schemes, where the set of features F{Y) in Y is not known. 

In the following it will be demonstrated that despite of the fact that the particular set of 
features F{Y) is not known (aside from it's cardinality /(Y)), it is possible to universally 
compact Y onto a suffix-tree with no more than 0{N) << N' leaves with only a negligible 
effect on the efficiency of any classifier. 
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Definition: Classification Error 



A test sequence X is misdassified if it is similar to Y, but is wrongly declared by the classifier 

to be not similar to Y, and therefore a not — acceptable one. 

Traditionally, one assumes that the test sequences are realizations( of some probabilistic 
source (e.g. a hidden Markov process) and evaluate the performance of a given classifier via 
the corresponding probability of a misclassification event. In our case , no such probabilistic 
model exist, and the performance of a classifier is evaluated via the average error-rate of 
misclassification in a sliding a window of length along the training sequence Y . 

Let the fraction of substrings of Y of length N among the N' — N such substrings in Y 
that are similar to Y, be g; < g < 1. 

An efficient training sequence Y for a given classifier C(Y,X) is expected to have high 
values of q. 

Given an individual training sequence Y and a classifier C(Y,X), the empirical classifi- 
cation efficiency may be evaluated by the following measure: 



Definition: Classification Error rate relative to Y. 

The error — rate Pc(Y, X) of a classifier C(Y,X) is the fraction p{Y, X) among the q{N' — N) 
substrings in Y of length N, that should be declared to be acceptable by the classifier, but 
are rejected by it. 

Theorem 1 Assume that the number of (unknown a-priori) features in Y is f{Y). Let e 
be an arbitrarily small positive number and consider the compaction of Y onto a suffix-tree 
with at most ^^^"^ leaves, which are the distinct substrings Yj(j); 1 < j' < A" m Y with an 
empirical probability of appearance in Y that is at least j^j^- 

Then, the error rate Pc(Y) o/C(Y,X) might be increased by the compaction by no more 
than -, which vanishes with e. 

n ' 
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Proof of Theorem 1: 



Let F*(Y) consist of all the substrings Yj(j); 1 < j < in Y, with an empirical probability 
less than jfj(Y)- 

Then, the average number of of instances in a substring of length A/" in Y at which some 

element of F*(Y) appears is at most ^^j"^^ = ^■ 

Therefore, by Chebyshev Inequality, the empirical probability of appearance of a suffix 
of length N that contains one or more elements of F*{Y) is at most e and the empirical 
probability that such an A^-suffix will be one of the q{N' — N) A^-suffixes that should be 
accepted, but might be rejected due to the compaction is at most ^. 

Thus, as claimed above, it is possible to universally compact a very long training sequence 
N' » N onto a Suffix-tree with no more than O(A^) leaves that serves as an alternative 
data base to Y, with only a negligible effect on the classifiers performance. 

Observe that a compaction of the training sequence Y onto a SufRx-tree with 0{N) leaves 
is traditionally justified under a probabihstic HMM regime, by assuming that A'" > 2^^^'"""^) 
where Lmax is the length of the longest feature-supporting suffix in Y, and H is the entropy of 
the HMM process that generates Y, thus, by the Asymptotic Equipartition Propcrty(AEP) 
of Information Theory, yielding a vanishing probability measure of F*(Y) as Lmax and N' 
tend to infinity [ e.g. Ulitsky T., et al,2006], thus bridging the probabilistic approach with 
the non-probabilistic one that is presented here. 
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